Day 20 利用transformer自己實作一個翻譯程式(二) 建立環境和下載資料集

2021 iThome 鐵人賽

DAY 20

AI & Data

Attention到底在關注什麼？系列第 20 篇

13th鐵人賽

guancioul

2021-09-20 16:53:51

2132 瀏覽

分享至

前言

一開始我會先實作葡萄牙翻譯成英文的模型，之後確定哪一個中翻英的資料集比較好之後，會再打一篇教學

建立環境

!pip install tensorflow_datasets
!pip install -U tensorflow-text

這隻程式需要先install tensorflow_datasets跟更新tensorflow-text(-U是更新的意思)

import collections
import logging
import os
import pathlib
import re
import string
import sys
import time

import numpy as np
import matplotlib.pyplot as plt

import tensorflow_datasets as tfds
import tensorflow_text as text
import tensorflow as tf

import這些套件，如果有出現問題的話就知道是哪一個套件沒有安裝好了，再Day 16 self-attention的實作準備(二) 設定tensorflow和keras的環境中有提到tensorflow要如何安裝並且確定版本

logging.getLogger('tensorflow').setLevel(logging.ERROR)  # suppress warnings

在這份文檔中有提到
logging.getLogger('tensorflow')是取得tensorflow這個套件的log，後面的setLevel是說如果在某個層級以上的錯誤才會顯示出來

因此setLevel(logging.ERROR)是說除了ERROR層級以上的錯誤之外，其他的log不會顯示出來

下載資料集

用tensorflow的dataset將葡萄牙語轉英文的翻譯資料集下載下來

這一個資料集有50000個訓練資料，1100個驗證資料以及2000筆測試資料

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

with_info=True的參數是代表說回傳的時候會回傳Dataset跟DatasetInfo
as_supervised=True的參數是代表說回傳資料集的時候，會幫你分好input跟label

for pt_examples, en_examples in train_examples.batch(3).take(1):
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))

  print()

  for en in en_examples.numpy():
    print(en.decode('utf-8'))

e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
mas e se estes fatores fossem ativos ?
mas eles não tinham a curiosidade de me testar .

and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .

這幾行程式是把資料集中的葡萄牙語跟英文的3筆資料print出來